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ABSTRACT 

The contention is made that group perforaance data are useful 
in the construction and interpretation of criterion-referenced tests. 
The Mastery Learning Test Model, which was developed for analyzing 
criterion-referenced test data, is described. An estimate of the 
proportion of students in an instructional group having achieved the 
referent objective is usable as a prior probability in interpreting 
individual responses. Considering instructional group performance 
enhances estimates of individual performance. Correlational data 
from a set of test items and a representative population of students 
are used to estimate the required item parameters. 
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The proper use of norm-group data, both for the construction and 
the application of criterion-referenced tests, is an issue needing 
resolution. Typically "criterion-referenced" is defined in relation 
to norm-referenced (or standardized) tests. Livingston (1972) states 
that "norm-referenced measures compare the student's perfomnance with 
the mean of a nonn group whereas criterion-referenced measures compare 
his performance with a specified criterion score." On the basis of such 
definitions, Airasian and Madaus (1972) conclude that "the interpreta- 
tion of a student's performance in a criterion-referenced situation is 
absolute and axiomatic, not dependent upon how ottier learners perform." 
Block (1971) observes that criterion-referenced "ineasiurements are 
absolute indices designed to indicate what the pupil has or has not 
learned from a given instructional segment. The measurements are 
absolute in that they are interpretable solely vis-a-vis a fixed per- 
formance standard to criterion and need not be interpreted relative to 
other measurements." 

These statements do not clarify the legitimacy or the value of 
norms in interpreting individual performance; they have led some to 
question the appropriateness of using any item selection procedure based 
on norm-group responses. It is contended here that norm-group perfor- 
mance is useful and legitimate information for both the construction 
and application of criterion-referenced tests. 

A criterion-referenced test is here defined as a set of items 
sampled from a domain which has been judged to be an adequate repre- 
sentation of an instructional objective. This definition does -lOt 
limit criterion-referenced tests to narrowly defined behavioral objec- 
tives for which an item form (Osburn, 1968) specifies how to generate 
every item in the domain. But, it is desirable that the domain be 
described in operational terms; using this description another test 
developer should be able to generate an equivalent domain of test items. 
The assumptions or theory relating the domain of items to the referent 
objective should be explicitly stated. 

Procedures for selecting a sample of items from a domain depend 
upon the intended application of the test. One application of a 
criterion-referenced test is to estimate the proficiency of individual 
students relative to some achievement continuum (Kriewall, 1972) . 
This appears to have been Glaser's (1963) original conception of the 
purpose of a criterion-referenced test, where he assumed that, 
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"Underlying the concept of achievement measurement is the notion 
of a continuum of knowledge acquisition ranging from no proficiency 
at all to perfect performance/* For applications where hand scoring 
of tests is used, a random or stratified random sampling of iteai^ 
from the domain permits the unweighted number of correct responses 
to be interpreted as a degree of proficiency measure. If computer 
scoring is used, a sample of highly discriminating items will yield 
a better estimate of proficiency. Thus, the rejection of sampling 
based on item discrimination indices (norm-group performance) is based 
on the assumptions that a degree-of-prof iciency measure is required and 
that the test must be hand scored, 

A frequent application of criterion-referenced tests is the 
making of categorical mastery, non-mastery decisions for students 
comprising an instructional group. Subsequent instruction for a 
student is contingent upon the category in which he is placed. 
Typically, test developers have computed a degree-of-prof iciency index 
and then, on most frequently an arbitrary basis, selected a critical 
"passing" score. A problem that arises is that it is difficult, 
perhaps impossible, to define a meaningful degree-of-prof iciency index 
for many types of legitimate instructional objectives. Ebel (1971) 
concludes that "criterion-referenced measurement may be practical 
in those few areas of achievement which focus on cultivation of a 
high^degree of skill in the exercise of a limited number of abilities." 
Ebel^s conclusion is based on the premise that a degree-of-prof iciency 
scale "anchored at the extremities — a score at the top of the scale 
indicating complete or perfect mastery of some defined abilities; 
one at the bottom indicating complete absence of those abilities" is 
required. Fortunately, such a measurement scale is not needed for 
the categorical decision application. 



THE MASTERY LEARNING TEST MODEL 

The Mastery Learning Test Model has been designed to provide 
an appropriate algorithm for analyzing criterion-referenced test data 
for making the following instruction decision: "which students have 
achieved the referent objective." Two statistics are computed: the 
probability that a given student has achieved the objective and the 
proportion of an instructional group that has achieved the objective. 
The model assumes that each student in an instructional group can be 
treated as belonging _to one of two groups — a group that has achieved 
the objective or one that has failed to achieve it. The two-state 
assumption does not deny the possibility of partial achievement of the 
objective. It does imply that categorization ot students into two 
groups, masters and non-masters, is the desired type of decision and 
the basis for subsequent instruction. 

The Mastery Learning Test Model and the true score theory upon 
which it is based are derived in an earlier paper CBesel , 1972). 
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This model is related to a simpler mastery testing model suggested 
by Emrick (1971). Emrick's model assumes that measurement error can 
be accounted for by two test parameters: a — Che probability that a 
non-master will give a correct answer to an item; and p — the probability 
that a master will give an incorrect answer to an item. His model implicitly 
assumes that all item difficulties and inter-item correlations are equal. 
This assumption can be avoided by increasing the number of test parameters — 
either by permitting item a parameters, or item p parameters, or both. 

PROBABILITY OF MASTERY ESTIMATION 
Let, 

represent the response of indiviciual j to item i. 



if a correct response is given (1) 
if an incorrect response is given 



= the probability that an individual in the M state will 

give a correct response to the i item. 

= the probability that an individual in the M state will 

th 

give an incorrect response to the i item. 

Using X to represent a response vector for a K-item test, Bayes formula 
can be used to estimate the conditional probability of mastery. 

K 

PRM- TT P(x^/M) 

P(M/X) = ^ , (2) 

PRM- TT P(x /M) + [1-PRM]. TT P(x /M) 
i=l ^ i=l ^ 



where PRM is the prior probability of the mastery state. The j subscript 
was deleted to simplify notation. The denominator of equation (2) repre- 
sents the prior probability of the response vector X. 



ESTIMATING THE PROPORTION OF STUDENTS IN THE MASTERY STATE 
Let, 

E(x^) represent the expected value of for a sample population 
of N students. 



A 



For an item with parameters (a^, , 



E (x^) 



M 



1-B. for the N individuals in the mastery state. 
^ ^ (3) 



E (X.) 



M 



= a. for the (N-N ) individuals in the non-mastery 
state. (A) 



Then, 



E ^x.) = - [N^. (1-p^.) + (N-N^).aJ 



m 1 



(5) 



Define proportion in mastery to be: 
N 



MP = 



N 



(6) 



An unbiased estimate of E(x.) is the proportion of students (PC.) 

t h 

in the sample which gave a correct response to the i item. 



Let GMP symbolize an estimate of the proportion in mastery, MP. 



Then, 



PC^ = GMP- (1-p^) + (1-GMP) -a^ 



(/) 



Solving for GMP yields 



GMP = 



PC. - 

1 i 

l-a.-p. ' 

1 *^i 



(8) 



Since each item was assumed to be a measure of the same objective, the 
proportion in mastery, MP, for each item — or for a K-item test — must be 
equivalent. The GMP estimate for a K-item test can be shown to be 



GMP 



^ U/K -_a 
1-ci-p 



(9) 
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where, 

K 

U = Z PC. is the test mean score; 
i=l ^ 

a is the average of the a^; 
P is the average of the [3^. 

PRIOR PROBABILITIES BASED ON COLLATERAL INFORMATION 

If mastery decisions are based upon responses to a small set of 
items sampled from a domain, it is likely that many errors of classi- 
fication will be made. One way of obtaining more information on each 
examinee without requiring the administration of additional test items 
is to use the collateral information contained in the test data of 
other students (Hambleton and Novick, 1973). The proportion in mastery 
estimate computed using equation (9) can be used as a prior probability 
estimate. Group-based priors may increase accuracy to an extent equivalent 
to adding between 6 and 25 items to a test as short as 5 items (Novick, 
Lewis, and Jackson, 1973). While the use of group-estimated priors is 
somewhat controversial for selection decisions across instructional groups 
(Novick, 1970), it promises to enhance instructional decisions within an 
instructional group. 



The probability-of-mastery measure is ideally suited for a decision- 
theoretic approach to selecting a cutting score for the mastery decision 
application. If L^^ and L2 are used to represent the losses associated 
with false-fail and false-pass misclassif ications , the appropriate 
cutting score on the probability-of-mastery measure can be shown to be: 



L2 



Only the ratio of L2 to Li need be specified to derive a cutting score. 
If proportion in mastery is used as prior probability, the cutting score 
will decrease as the proportion in mastery estimate increases. 



PARAMETER ESTIMATION 

Both a and p item parameters can be estimated from the item 
response data collected from a representative sample of students. Two 
parameter estimation algorithms have been developed for a Mastery 
Learning Model which has a single test~p parameter and item~a 
parameters.* Least-squares estimates of the parameters are computed 



Computer program listings are available from the author upon 
request . 
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using three classes of empirical data: 

1. Item difficulties 

2. Inter-item covariances 

3. Score histograms 

The first algorithm computes the least-squares estimates using 
an independent estimate of the proportion of students that have achieved 
the referent objective (GMP) . The second algorithm requires no input 
estimate of GMP: it is estimated from the data in addition to the a 
and [3 parameters. 

The stability of the parameter estimates was evaluated, for each 
algorithm, using test data from the end-of-unit criterion exercises of 
the SWRL Beginning Reading Program. Data from two consecutive years 
(1970-71 and 1971-72) were sampled from schools participating in the 
quality assurance tryout of the SWRL reading program. Each criterion 
exercise measured the achievement of four program objectives: (1) 
words in a storybook, (2) word elements, (3) word attack (novel words), 
and (4) letter names. Five, three-option multiple-choice items were 
used for each objective. Data from all 10 urits of the program were 
analyzed; the sample sizes shrank from 263 to 98 for the first year and 
from 418 to 173 for the second year. 

The means and variances of the differences between the parameter 
estimates for the two years were examined (see Table 1) . Computations 
were made for item a, average a (oO , and test p. For the "Fixed GMP" 
algorithm two estimates of GMP were used. The first estimate was the 
proportion of students scoring 80% (4 right out of 5) or better for the 
outcome. The second estimate was the proportion with a perfect score. 
The item a differences are based on 50 items, average a and test p on 
10 tests. The mean differences could be due partially to systematic 
differences in the student populations since different school districts 
were represented in the two samples. The variances are more appropriate 
estimates of parameter stability. 

For the second algorithm (GMP not fixed) the variances vary 
considerably across outcomes. The "fixed-GMP" algorithm achieved 
uniformly better stability with the perfect score criterion noticeably 
better than with the 80% criterion. The variances for both item a 
and average a decreased as the difficulty of the objective increased. 
Letter names was the easiest objective, word attack the most difficult. 
The variances of test p, on the other hand, increased as the difficulty 
of the objective increased. This trend was apparent in all three sets 
of calculations for both algorithms. This result is consistent with the 
notion that ideally one would like to estimate p from the responses of 
a group — all of which have achieved the objective. Likewise the item 
alphas could be "best" estimated from a group — none of which have 
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Table 1, Stability of Mastery Learning Parameters 
(Mean Difference/Variances of Difference) 



Outcome 


Pa rameter 


Minimum Sum o f 
Squares Solution 


1 oO/, Criterion 
Solution 


100% Criterion 
Sol ution 


1 

Storybook 
Words 


Item y 


" . Uo i 


-.025 

^-^-^0191 


-.013 ^.-^^^ 
^^^0076 


a 


-.081 ^^^.--^^^^ 
^^^^^^.^---"^.0122 


-.026 ^.-^'^'^^^^ 
^.--^^0031 


- n 1 7 


3 


.018 ^y'^ 


-.002 ^^^^^ 
^^-"^0002 


- . 004 

^^.^--^oooi 


2 

Program 

Word 
Elements 


Item 


^^^-^^0126 




-.041 


a' 


-.059 ^^^'^ 
^^--^.0033 


-.042 ^■^"■'"'''^^ 
^^,^-^^0015 




B 


-.003 ^--^ 
^---^.0004 


-.007 ^'-^-^ 
^.^-"'^0005 


-.006 
^^^^^000 1 


3 

Word 
Attack 


Item o 


n T 7 ..^f^^^"^ 
- .UJ / 

^.--^.0083 


-.032 

^.^-^^^ .009b 


-.020 ^^^^^^""^ 
.0043 


a 


-.037 ^^^"'"'^^ 
^.--'^^'^.0011 


-.032 

^>^0017 






0 


-.000 ^^^^ 

^^"■""^.oooe 


-.001 ^.^"^^ 
^--"^0006 


-.003 ^^--^ 

--"^0001 




Item Q* 


^^^^-"^"^^.0956 


-.026 ^^^^ 
^,,^^'^0354 


-.036 ^.^"^ 
^^-^.00%0 


4 

Letter- 


a 


.052 ^^^^.^^'^ 
^^^^..^ .0418 


-.026 ^."^^'^^ 


~ (T^ft ^^^^^^^^ 
^^^^0010 


Names 


e 


-.004 ^^^^ 


-.006 ^^^^^ 


-.004 
^^^^0000 
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achieved the objective. When a mixed group is used, p is estimated 
most accurately when a high proportion of the group has achieved the 
objective. Lowering the GMP of the norm group improves the accuracy 
of the a estimates at the expense of p accuracy. 

ITEM SELECTION 

If the a and p parameters are estimated for a large sample of items 
from a domain, using an appropriate norm group, a small set of highly 
discriminating items can then be selected for future mastery-decision 
applications. The most promising item discrimination index is 

Yi = 1 - ct. - p. (11) 

Items with a high y index provide the most information for the mastery- 
decision application. 



SUMMARY 

The usage of an independent estiiPi?.t3 of the proportion of students 
in a norm group which have achieved an objective resulted in significantly 
improved stability of mastery learni"^- parameters. This should result in 
increased validity of the Mastery Learning Test Model for making cate- 
gorical mastery/ non-mastery decisions. This test model can be used to 
make mastery decisions on the basis of very short tests. Using the 
proportion-in-mastery estimate for an instructional group as a prior- 
probability results in improved estimates of the probability that an 
individual student has achieved the objective. Norm-group data can also 
be used to select the best set of items from a domain for the mastery 
decision application. 
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